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[57] ABSTRACT 

A timer is periodically reset by a software agent executing 
on a processor. If the timer is not reset within a predeter- 
mined period of time, an interrupt is generated. An interrupt 
handler then periodically resets the timer, and if the timer is 
not reset within an additional predetermined period of time, 
the computer system is partially reset. 

12 Claims, 2 Drawing Sheets 
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METHOD AND APPARATUS FOR SUMMARY OF THE INVENTION 

DETECTING AND RECOVERING FROM A met hod and apparatus for detecting and recovering 

COMPUTER SYSTEM MALFUNCTION from a mmpMa system malfunction is disclosed. A timer is 

RELATED APPLICATIONS s """S T' by ' f°^re agent executing on a pro- 

cessor. If the timer is not reset within a determined period of 

Reference is made to the following commonly assigned time, an interrupt is generated. An interrupt handler then 

copending patent applications: periodically resets the timer, and if the timer is not reset 

U.S. Ser. No. 08/935,115, entitled "Method and Apparatus an additional predetermined period of time, the 

for Detecting and Reporting Failed Microprocessor 1Q computer system is at least partially reset. 

Reset"; and BRIEF DESCRIPTION OF THE DRAWINGS 

U.S. Ser. No, 08/933,629, entitled "Method and Appara- FIG. 1 shows a flow diagram of a method for detecting 

tus for Reporting Malfunctioning Computer System", and recovering from a com p Utcr sys tcm malfunction imple- 

each of which is incorporated by reference herein. mented in accordance with one embodiment of the inven- 

BACKGROUND OF THE INVENTION 15 ilOU ' 

FIG. 2 depicts a block diagram of computer system 

1. Field of the Invention implemented in accordance with one embodiment of the 
The present invention pertains to the field of computer invention. 

systems. More particularly, this invention pertains to the DETAILED DESCRIPTION 

field of detecting and recovering from computer system 20 

malfunctions. ^ method and apparatus for detecting and recovering 

2. Background of the Related Art ? om computer system malfunctions is disclosed. In the 

_ following description, for the purposes of explanation, spe- 

For many years, computer system manufacturers, com- ciflc deUils m ^ forth t£) Ue a , h h understandi 

puter component manufacturers, and computer users have ^ f ^ However, it will be apparent to one skilled 

been concerned with detecting and recovering from com- m ^ ^ ^ ^ iflc detlils ^ re not ired , 0 

puter system malfunctions There are many reasons why a dce me invention In other inst we J] 

computer system might malfunction including memory data memod ^ md stmes m aQi described [q 

corruption, data corruption related to fixed disks or remov- ^ ^ ordef , o avoi(J obscad the mvemion 

able media, operating system errors, component errors, 3Q Q verv j ew 

components overheating, applications or operating systems ™ . , U1 c A . 4 . , 

* .„ , . & ' ., , The invention solves the problem of detecting and recov- 
perrormmg illegal instructions with respect to the processor. e . * ia *• t i j ■ 
• . t_ 4 l j j I enng trom computer system malfunctions. In general, and m 
incompatibility between various hardware and software sys- ° A f ' u , . . e # , . ° . . . 
te co one ts etc accordance with one embodiment of the invention, a timer 
P » ■ is set upon starting the computer. An operating system- 
Some of these types of malfunctions have been effectively 35 related 50^^ agent running on a processor periodically 
dealt with by prior systems. For example, memory data rcsets the timer If tbe tirrjer eV er expires, an interrupt is 
corruption can be handled by panty detection and/or error generated which causes the processor to execute an interrupt 
correcting code (ECC). Illegal instructions can be trapped by handler which is unrelate d t0 the operating system. The term 
the processor and in many cases handled either within the "interrupt" as used herein includes all manner of interrupts, 
processor or by the operating system. Other malfunctions 40 mc i ud i n g, but not limited to, Peripheral Component Inter- 
may result in system "hangs." A system is "hanged" when it cormect (vcl) int errupts, Industry Standard Architecture 
is no longer able to respond to user inputs. Some malfunc- q SA) i nterrupts> System Management Interrupts (SMI), and 
Hons that can result in system hangs include operating Non-Maskable Interrupts (NMI). When the interrupt handler 
systems or hardware components entering unknown or inde- ^ called) the Urner fc reset t0 its value ^ interrupl 
terminate states, causing the operating system or hardware 45 handler causcs thc timcr t0 bc periodically reset while it 
component to cease normal operation. In these cases, the atlempts l0 cure the malfunction that caused the timer to 
computer user must restart the computer. Restarting the cxpirc prcv j ous i y> If t he timer expires while the interrupt 
computer after a system hang can cause problems such as handler fe executing, a partial reset is performed. The partial 
data loss and corruption. rcsct f u u y resets tne proce ssor and further resets portions of 
Some prior computer systems have included timers 50 other system components. The partial reset allows the state 
known as "watchdog" timers. A typical watchdog timer 0 f the various system components to be maintained while 
implementation involves a processor periodically resetting a the system is restarted, 
timer, and under normal operation the timer never reaches a Embodiments of the Invention 

certain value. If the timer ever reaches the certain value, the FIG. 1 shows a flow diagram of a method for detecting 

computer system is reset. This solution causes no action to 55 and recovering from a computer system malfunction imple- 

take place to attempt to cure the malfunction other than to mented in accordance with one embodiment of the inven- 

take the drastic action of resetting the computer system. Uon. At step 110, a timer is loaded. The timer may be a 

Resetting the computer system may result in the same count-down timer that is initially loaded with a value and 

problems mentioned above with regard to a user restarting a over a period of time counts down to zero unless it is 

computer, including data loss and corruption. 60 reloaded. Other types of timers or counters may also be used 

Separate error checking processors have been included in with the invention, including counters that start at a value 

computer systems in order to detect and attempt to recover and count up until a trigger value is reached. In the present 

from system hangs. This solution has the disadvantage of embodiment, the timer is of the count-down type. The timer 

being costly. The computer user benefits from less costly is initially loaded upon system start up as part of the boot 

computer systems. Therefore, a lower cost method and 65 process. 

apparatus for detecting and recovering from computer sys- Following the load timer step 110, the timer is checked 

tern malfunctions is desirable. after a period of time at step 120 in order to determine 
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whether the timer has expired. The checking is preferably 
performed by a software agent running on a processor. The 
software agent is typically related to an operating system. If 
the timer has not expired, the software agent causes the timer 
to be reset at step 130. Following step 130, the timer is again 
rechecked after a period of time at step 120. Steps 120 and 
130 are repeated continuously so long as no computer 
system malfunction exists that would prevent the software 
agent from resetting the timer. Malfunctions that would 
prevent the timer from being reset include the operating 
system misbehaving in such a manner that it is unable to 
schedule and run the software agent. Another possible 
malfunction that would prevent the software agent from 
resetting the timer is a broken data or address path between 
the processor and the timer such that even though the 
operating system is behaving properly and the processor is 
able to run the software agent, the processor is not able to 
cause the timer to be reloaded. The processor itself may also 
malfunction in such a manner that it is unable to execute the 
software agent. Other malfunctions are possible, including 
the operating system waiting for a misbehaving peripheral. 

If the timer does expire, an interrupt is generated at step 
140. In this embodiment, the generated interrupt causes the 
processor to execute an interrupt handler. As mentioned 
above, it is possible that a processor malfunction caused the 
timer to expire. If the processor is not operating properly, it 
likely will not be able to execute the interrupt handler. This 
case is discussed below. The discussion below regarding the 
execution of the interrupt handler assumes that the processor 
is operating in such a manner that it is able to execute the 
handler. 

The interrupt handler is not related to the operating 
system and is stored in nonoperating system memory space. 
Since the interrupt handler is not related to the operating 
system, the processor is able to execute the interrupt handler 
even if the operating system is behaving improperly. The 
interrupt handler attempts to investigate and cure the mal- 
function that allowed the timer to expire. It is possible for the 
interrupt handler to attempt to cure a broad range of possible 
system malfunctions. 

Upon the generation of the interrupt, the timer is reloaded 
at step 150. The reloading is preferably accomplished auto- 
matically by system logic. The processor cannot be relied on 
to perform the reload timer step 150 since a processor 
malfunction may have resulted in the timer expiring. 

The interrupt handler checks the timer to see if it has 
expired a second time at step 160. If the timer has not 
expired, the timer is reset by the interrupt handler at step 
170, Steps 160 and 170 are periodically repeated so long as 
the interrupt handler is executing. If the timer expires a 
second time, it is likely an indication that either the proces- 
sor is unable to execute the interrupt handler or there is a 
broken data or address path between the processor and the 
timer such that even if the processor is able to properly 
execute the interrupt handler the timer is never reset. 

If the timer expires a second time, a system reset occurs 
at step 180. Preferably, the system reset is a partial system 
reset. A partial system reset may involve the processor, the 
memory controller, and portions of system peripherals. The 
partial system reset seeks to retain system state information 
so that the system can attempt to cure system malfunctions 
during the reboot process. An indication is preferably main- 
tained by the system logic that indicates to the system Basic 
Input/Output System (BIOS) that the current boot process 
was triggered by a partial system reset and that steps should 
be taken to investigate and attempt to cure any system 
malfunctions. 
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In an alternative embodiment, the timer is reloaded a 
second time upon the generation of the partial system reset. 
The BIOS periodically resets the timer during the boot 
process and while it attempts to cure any malfunctions. 
5 Should the timer expire a third time, a more complete system 
reset is performed and the boot process is attempted again. 
The steps of loading the timer, periodically resetting the 
timer during the boot process and while attempting to cure 
the malfunction, and performing a more complete system 
reset can be repeated any number of times. Each time the 
timer expires, more severe actions can be performed in order 
to attempt to cure the malfunction. The most severe action 
might include powering down and then powering up the 
system. 

FIG. 2 depicts a block diagram of a computer system 200 

15 implemented in accordance with one embodiment of the 
invention. The computer system 200 typically includes a 
host bus 220 for communicating information, such as 
instructions and data. The system further includes a proces- 
sor 205, coupled to the host bus 220, for processing infor- 

20 mation according to programmed instructions, and memory 
devices including an operating system-related software 
agent storage area 210 and an interrupt handler storage area 
215 coupled to the host bus 220 for storing information for 
processor 205. The storage area 210 has stored therein a 

25 software agent 212 and the storage area 215 has stored 
therein an interrupt handler 217. 

The processor 205 could be an 80960, 386, 486, Pen- 
tium® processor, Pentium® Pro processor, or Pentium® II 
processor made by Intel Corp., among others, including 

30 processors that are compatible with those listed above. The 
memory devices 210 and 215 may include a random access 
memory (RAM) to store dynamic information for processor 
205, a read-only memory (ROM) to store static information 
and instructions for processor 205, or a combination of both 

35 types of memory. 

An expansion bus bridge 230 couples the host bus 220 to 
an expansion bus 240. Devices coupled to the expansion bus 
240 include a display device 245, and alphanumeric input 
device 250, a BIOS read-only memory 255, and an infor- 

40 mation storage device 260 for storing information including 
an operating system 262 and applications 264. 

In alternative designs for the computer system 200, infor- 
mation storage device 260 could be any medium for storage 
of computer readable information. Suitable candidates 

45 include a read-only memory (ROM), a hard disk drive, a 
disk drive with removable media (e.g., a floppy magnetic 
disk or an optical disk), or a tape drive with removable 
media (e.g., magnetic tape), synchronous DRAM or a flash 
memory (i.e., a disk-like storage device implemented with 

50 flash semiconductor memory). A combination of these, or 
other devices that support reading or writing computer 
readable media, could be used. 

The display device 245 may be a liquid crystal display, a 
cathode ray tube, or any other device suitable for creating 

55 graphic images or alphanumeric characters recognizable to 
the user. The alphanumeric input device 612 typically is a 
keyboard with alphabetic, numeric, and function keys, but it 
may be a touch sensitive screen or other device operable to 
input alphabetic or numeric characters. 

60 The expansion bus bridge 230 includes a timer 232, a 
timer initial value register 234, and a partial reset flag 236. 
The timer 232, timer initial value register 234, and partial 
reset flag 236 are not restricted to being included in the 
expansion bus bridge, but may be located elsewhere in the 

65 system. 

Upon system start-up, the timer 232 is loaded with the 
value stored in the timer initial value register 234. The timer 
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232 is then periodically reset with the value stored in register 
234 by the software agent 212. The software agent 212 is 
periodically scheduled to execute on the processor by the 
operating system 262. If the timer 232 expires, an interrupt 
signal 224 is asserted to the processor 205. The interrupt s 
signal 224 causes the processor to execute the interrupt 
handler 217. Also, when the timer 232 expires the timer 232 
is automatically reloaded with the value stored in register 
234. 

The interrupt handler 217 attempts to investigate and cure 10 
any system malfunction that resulted in the timer 232 
expiring. Further, while the interrupt handler 217 is execut- 
ing it periodically resets the timer 232 in order to prevent it 
from expiring again. 

If the timer 232 expires a second time, a reset signal 222 15 
is sent to the processor The reset signal 222 may also be 
communicated to other system devices. The reset signal 222 
causes the processor and possible other devices to perform 
a partial reset. The partial system reset is discussed above in 
connection with FIG. 1. When the reset signal 222 is 20 
asserted, the partial system reset flag 236 is set. When the 
system restarts as a result of the partial system reset, the 
BIOS (stored in BIOS ROM 255), when executed by the 
processor 205 during the boot process, will cause the partial 
reset flag 236 to be read in order to determine whether a 25 
partial reset has occurred. If the flag is set, the BIOS will 
attempt to cure any system defects, as discussed above in 
connection with FIG. 1. 

It will be clear to one skilled in the art that the invention 
can operate upon a wide range of programmable computer 30 
systems, not just the example computer system 200. 

In the foregoing specification the invention has been 
described with reference to specific exemplary embodiments 
thereof It will, however, be evident that various modifica- 
tions and changes may be made thereto without departing 35 
from the broader spirit and scope of the invention as set forth 
in the appended claims. The specification and drawings are 
accordingly to be regarded in an illustrative rather than in a 
restrictive sense. 

What is claimed is: 40 

1. A method, comprising the steps of: 

periodically resetting a timer, the step of resetting the 
timer performed by a software agent executed on a 
processor; 

triggering an interrupt if the timer is not reset within a 45 
predetermined period of time, the interrupt to indicate 
a malfunction; 

executing an interrupt handler if the interrupt is triggered, 
the interrupt handler to cause the timer to be periodi- 5Q 
cally reset, the interrupt handler to function indepen- 
dently of the software agent and the interrupt handler to 
attempt to cure the malfunction; and 

performing at least a partial reset of the computer system 
if the timer is not reset by the interrupt handler within 
an additional predetermined period of time. 



2. The method of claim 1 wherein the step of periodically 
resetting a timer includes executing an operating system - 
related software agent on a processor. 

3. The method of claim 1 further comprising the step of 
initially loading the timer with a value stored in a register. 

4. The method of claim 3 wherein the step of periodically 
resetting the timer includes the step of loading the timer with 
a value stored in a register. 

5. The method of claim 4, wherein the step of executing 
an interrupt handler includes the step of loading the timer 
with a value stored in register. 

6. The method of claim 1 further comprising the step of 
providing an indication accessible to the processor that the 
step of performing at least a partial reset of the computer 
system has been performed. 

7. The method of claim 1 further comprising the step of 
fully resetting the computer system if the step of performing 
at least a partial reset of the computer system is not suc- 
cessful in curing the malfunction. 

8. The method of claim 1 wherein the step of periodically 
resetting a timer includes periodically resetting a first timer, 
the step of triggering an interrupt includes triggering an 
interrupt if the first timer is not reset after a predetermined 
period of time, the step of executing an interrupt handler 
includes causing a second timer to be periodically reset, and 
the step of performing at least a partial reset of the computer 
system includes partially resetting the computer system if 
the second timer is not reset after a predetermined period of 
time. 

9. A computer system, comprising: 
a processor coupled to a bus; 

a first storage area coupled to the bus, the first storage area 
to store an operating system-related software agent that 
when executed by the processor causes a first timer to 
be periodically reset; 

circuitry for signaling an interrupt to the processor when 
the first timer is not reset after a predetermined period 
of time, the interrupt to indicate a malfunction; 

a second storage area coupled to the bus, the second 
storage area to store an interrupt handler that functions 
independently of the operating system-related software 
agent, the interrupt handler to cause a second timer to 
be periodically reset when executed by the processor 
and the interrupt handler to attempt to cure the mal- 
function; and 

circuitry for causing at least a partial system reset when 
the second timer is not reset after a predetermined 
period of time. 

10. The system of claim 9 further comprising a flag for 
indicating that a partial system reset has occurred. 

11. The system of claim 9 wherein the first and second 
timers are implemented as a single, reloadable timer. 

12. The system of claim 11 further including circuitry for 
loading the reloadable timer with a value stored in a register. 
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